fix: zombie processes during restart #10650

pkoutsovasilis · 2025-10-17T11:24:35Z

What does this PR do?

This PR fixes zombie/defunct processes that are left behind when Elastic Agent re-executes itself during restart. The fix involves:

Decreasing the EDOT collector shutdown timeout from 30 seconds to 3 seconds to accommodate the default 5-second timeout of the coordinator shutdown timeout
- Adding a safety net that waits an additional second after killing a process to ensure Wait() is called
Improving graceful shutdown handling in the EDOT collector subprocess manager to ensure proper process cleanup
Adding debug logging throughout the shutdown process to better trace subprocess termination
Adding an integration test that verifies no zombie processes are left behind after agent restart

Why is it important?

Root Cause

When the Elastic Agent re-executes itself during restart, the following sequence occurs:

If a subprocess (particularly the EDOT collector or command components) takes longer than the coordinator's 5-second shutdown timeout, the agent proceeds to execve itself
During execve, all threads other than the calling thread are destroyed
This triggers the PDeathSig mechanism we enable for subprocesses
However, the parent process (pre-execve Elastic Agent) never reaps (waits for) the exit status of spawned subprocesses
Result: these subprocesses end up as defunct/zombie processes

Why This Affects EDOT More Than Beats

Beats subprocesses typically terminate almost immediately (within the 5-second window), so they don't become zombies. However, the EDOT collector's shutdown time seemed to be affected by:

Number of pipeline workers
Elasticsearch exporter configuration

Impact

Resource leaks: Zombie processes consume PIDs and kernel memory
Operational issues: Accumulation of zombies over multiple restarts
Config update delays: EDOT subprocess restarts on every config change, and 20+ second shutdowns create significant latency

This fix ensures proper process cleanup regardless of shutdown duration while maintaining graceful termination when possible.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

Users may notice:

Agent restarts take slightly longer (up to 35 seconds instead of 5 seconds in worst case)
However, this ensures clean shutdowns and prevents zombie accumulation
The tradeoff is worthwhile as zombie processes can cause operational issues over time

How to test this PR locally

Run TestMetricsMonitoringCorrectBinaries integration test

Related issues

Elastic-Agent does not allow child process to clean up during shutdown #7756

elasticmachine · 2025-10-21T10:32:10Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

swiatekm

LGTM, some relatively minor nitpicks.

pkg/component/runtime/command.go

testing/integration/ess/metrics_monitoring_test.go

michalpristas

small nits, but it looks good, will approve after green CI

michalpristas · 2025-10-23T09:27:32Z

internal/pkg/agent/application/application.go

 		cfg.Settings.Collector,
 		monitor.ComponentMonitoringConfig,
-		cfg.Settings.ProcessConfig.StopTimeout,
+		3*time.Second, // this needs to be shorter than 5 * time.Seconds (coordinator.managerShutdownTimeout) otherwise we might end up with defunct processes


worth to make it a const with a comment

changed in f66af91

michalpristas · 2025-10-23T09:30:13Z

internal/pkg/otel/manager/execution_subprocess.go

-		s.log.Warnf("timeout waiting (%s) for the supervised collector to stop, killing it", waitTime.String())
-		// our caller ctx is Done; kill the process just in case
-		_ = s.processInfo.Kill()
+	case <-time.After(1 * time.Second):


so worst case is waitTime + 1s. please update func docs

changed in f66af91

… 35sRetry

elasticmachine · 2025-10-24T07:22:57Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: bdb676e

Failed CI Steps

History

💚 Build #29233 succeeded f66af91
💔 Build #29137 failed b37db6c
💔 Build #29108 failed 2cae2b1
💔 Build #29104 failed d3f847e
💛 Build #29052 was flaky e2f8cbd

cc @pkoutsovasilis

pkoutsovasilis self-assigned this Oct 17, 2025

pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch backport-9.2 Automated backport to the 9.2 branch labels Oct 17, 2025

pkoutsovasilis force-pushed the fix/cordinator_timeout branch 2 times, most recently from 6186951 to a32c16f Compare October 21, 2025 10:11

pkoutsovasilis removed the skip-changelog label Oct 21, 2025

pkoutsovasilis mentioned this pull request Oct 21, 2025

[beats receivers] Defunct elastic-agent otel --supervised process left behind when Elastic Agent re-executes itself #10632

Open

pkoutsovasilis marked this pull request as ready for review October 21, 2025 10:32

pkoutsovasilis requested a review from a team as a code owner October 21, 2025 10:32

pkoutsovasilis requested review from blakerouse and swiatekm October 21, 2025 10:32

swiatekm reviewed Oct 21, 2025

View reviewed changes

pkg/component/runtime/command.go Show resolved Hide resolved

testing/integration/ess/metrics_monitoring_test.go Outdated Show resolved Hide resolved

testing/integration/ess/metrics_monitoring_test.go Outdated Show resolved Hide resolved

swiatekm self-requested a review October 21, 2025 14:33

swiatekm previously approved these changes Oct 21, 2025

View reviewed changes

pkoutsovasilis dismissed swiatekm’s stale review via d3f847e October 22, 2025 11:21

pkoutsovasilis force-pushed the fix/cordinator_timeout branch 2 times, most recently from 2cae2b1 to b37db6c Compare October 23, 2025 07:42

michalpristas reviewed Oct 23, 2025

View reviewed changes

pkoutsovasilis added 7 commits October 23, 2025 21:08

fix: zombie processes during restart by extending shutdown timeout to…

1263f24

… 35sRetry

fix: linter QF1003 could use tagged switch on state

c9fa8d8

fix: linter QF1012

0abf1dc

doc: add changelog

82c10c0

doc: reword test code comments

49d4dfe

fix: make otel manager process stop timeout way shorter

f3eee75

doc: add more documentation

f66af91

pkoutsovasilis force-pushed the fix/cordinator_timeout branch from b37db6c to f66af91 Compare October 23, 2025 18:16

cmacknz linked an issue Oct 23, 2025 that may be closed by this pull request

[beats receivers] Defunct elastic-agent otel --supervised process left behind when Elastic Agent re-executes itself #10632

Open

pkoutsovasilis requested review from michalpristas and swiatekm October 23, 2025 21:23

cmacknz mentioned this pull request Oct 23, 2025

[beats receivers] Collector sub-process shutdown timeout needs to be configurable and contextual #10786

Open

doc: remove changelog fragment

bdb676e

pkoutsovasilis added the skip-changelog label Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: zombie processes during restart #10650

fix: zombie processes during restart #10650

pkoutsovasilis commented Oct 17, 2025 •

edited

Loading

Uh oh!

elasticmachine commented Oct 21, 2025

Uh oh!

swiatekm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michalpristas left a comment

Uh oh!

michalpristas Oct 23, 2025

Uh oh!

pkoutsovasilis Oct 23, 2025

Uh oh!

michalpristas Oct 23, 2025

Uh oh!

pkoutsovasilis Oct 23, 2025

Uh oh!

elasticmachine commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: zombie processes during restart #10650

Are you sure you want to change the base?

fix: zombie processes during restart #10650

Conversation

pkoutsovasilis commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Root Cause

Why This Affects EDOT More Than Beats

Impact

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Uh oh!

elasticmachine commented Oct 21, 2025

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michalpristas left a comment

Choose a reason for hiding this comment

Uh oh!

michalpristas Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

pkoutsovasilis Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

michalpristas Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

pkoutsovasilis Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Oct 24, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pkoutsovasilis commented Oct 17, 2025 •

edited

Loading